Search CORE

18 research outputs found

Design and update of a classification system : the UCSD map of science

Author: Biberstine Joseph R.
Boyack Kevin W.
Börner Katy
Klavans Richard
Larivière Vincent
Light Robert P.
Patek Michael
Zoss Angela M.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 12/07/2012
Field of study

Global maps of science can be used as a reference system to chart career trajectories, the location of emerging research frontiers, or the expertise profiles of institutes or nations. This paper details data preparation, analysis, and layout performed when designing and subsequently updating the UCSD map of science and classification system. The original classification and map use 7.2 million papers and their references from Elsevier’s Scopus (about 15,000 source titles, 2001–2005) and Thomson Reuters’ Web of Science (WoS) Science, Social Science, Arts & Humanities Citation Indexes (about 9,000 source titles, 2001–2004)–about 16,000 unique source titles. The updated map and classification adds six years (2005–2010) of WoS data and three years (2006–2008) from Scopus to the existing category structure–increasing the number of source titles to about 25,000. To our knowledge, this is the first time that a widely used map of science was updated. A comparison of the original 5-year and the new 10-year maps and classification system show (i) an increase in the total number of journals that can be mapped by 9,409 journals (social sciences had a 80% increase, humanities a 119% increase, medical (32%) and natural science (74%)), (ii) a simplification of the map by assigning all but five highly interdisciplinary journals to exactly one discipline, (iii) a more even distribution of journals over the 554 subdisciplines and 13 disciplines when calculating the coefficient of variation, and (iv) a better reflection of journal clusters when compared with paper-level citation data. When evaluating the map with a listing of desirable features for maps of science, the updated map is shown to have higher mapping accuracy, easier understandability as fewer journals are multiply classified, and higher usability for the generation of data overlays, among others

Directory of Open Access Journals

PubMed Central

Dépôt Institutionnel Numérique

Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches

Author: AGK Janacek
André Skupin
BC Vanteru
Bob Schijvenaars
Colin Allen
David Newman
DJ Newman
DK Harman
DM Blei
EM Voorhees
EP Jiang
F Janssens
G Gorrell
G Salton
GL Poulter
GR Hjaltason
HM Müller
J Lewis
J Lin
J Lin
Joseph R. Biberstine
K Börner
K Järvelin
K Sparck Jones
K Sparck Jones
Katy Börner
Kevin W. Boyack
KW Boyack
KW Boyack
KW Boyack
MA Hearst
MD Cao
Michael Patek
MW Berry
N Jardine
Nianli Ma
NJ Belkin
P Ahlgren
P Ahlgren
P Calado
P Castells
R Kassab
R Klavans
Richard Klavans
Russell J. Duhon
S Deerwester
S Martin
SE Robertson
T Couto
T Hofmann
T Kohonen
T Kohonen
T Theodosiou
TG Kolda
TK Landauer
WS Cooper
Y Aphinyanaphongs
Y Yamamoto
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents.We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models--BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE.PubMed's own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts

Public Library of Science (PLOS)

Crossref

IUScholarWorks (University of Indiana)

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Visualizing the Topical Structure of the Medical Sciences: A Self-Organizing Map Approach

Author: André Skupin (227241)
Joseph R. Biberstine (151502)
Katy Börner (151476)
Publication venue
Publication date: 12/03/2013
Field of study

<div>BackgroundWe implement a high-resolution visualization of the medical knowledge domain using the self-organizing map (SOM) method, based on a corpus of over two million publications. While self-organizing maps have been used for document visualization for some time, (1) little is known about how to deal with truly large document collections in conjunction with a large number of SOM neurons, (2) post-training geometric and semiotic transformations of the SOM tend to be limited, and (3) no user studies have been conducted with domain experts to validate the utility and readability of the resulting visualizations. Our study makes key contributions to all of these issues. MethodologyDocuments extracted from Medline and Scopus are analyzed on the basis of indexer-assigned MeSH terms. Initial dimensionality is reduced to include only the top 10% most frequent terms and the resulting document vectors are then used to train a large SOM consisting of over 75,000 neurons. The resulting two-dimensional model of the high-dimensional input space is then transformed into a large-format map by using geographic information system (GIS) techniques and cartographic design principles. This map is then annotated and evaluated by ten experts stemming from the biomedical and other domains. ConclusionsStudy results demonstrate that it is possible to transform a very large document corpus into a map that is visually engaging and conceptually stimulating to subject experts from both inside and outside of the particular knowledge domain. The challenges of dealing with a truly large corpus come to the fore and require embracing parallelization and use of supercomputing resources to solve otherwise intractable computational tasks. Among the envisaged future efforts are the creation of a highly interactive interface and the elaboration of the notion of this map of medicine acting as a base map, onto which other knowledge artifacts could be overlaid. </div

Directory of Open Access Journals

PubMed Central

FigShare

The ten terms occurring in the largest number of contiguous patches.

Author: André Skupin (227241)
Joseph R. Biberstine (151502)
Katy Börner (151476)
Publication venue
Publication date
Field of study

Terms are ordered according to the number of patches. Also given is the number of neurons over which those patches are distributed.</p

FigShare

Statistics of label placement for top five term dominance levels.

Author: André Skupin (227241)
Joseph R. Biberstine (151502)
Katy Börner (151476)
Publication venue
Publication date
Field of study

The first three levels correspond to the blue, red-orange, and green layers in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0058779#pone-0058779-g003" target="_blank">Figure 3</a>.</p

FigShare

Top ten terms for four neurons along the transect in Figure 4.

Author: André Skupin (227241)
Joseph R. Biberstine (151502)
Katy Börner (151476)
Publication venue
Publication date
Field of study

Table includes four neurons: the first neuron along the transect, its immediate neighbor, the pivot neuron, and the final neuron. For each neuron, terms are ranked according to their relative term dominance.</p

FigShare

Zoomed-out view of the complete map of medical literature, plus detailed views of several regions.

Author: André Skupin (227241)
Joseph R. Biberstine (151502)
Katy Börner (151476)
Publication venue
Publication date
Field of study

Contents and design as presented to domain experts for qualitative evaluation.</p

FigShare

Geometric zooming versus semantic zooming.

Author: André Skupin (227241)
Joseph R. Biberstine (151502)
Katy Börner (151476)
Publication venue
Publication date
Field of study

Juxtaposed are examples of geometric zooming into the static display of multiple levels optimized for preventing label overlaps (top) versus semantic zooming with successive revealing of lower levels of term dominance (bottom).</p

FigShare

Processing steps for visualizing a large corpus of medical literature based on the self-organizing map method.

Author: André Skupin (227241)
Joseph R. Biberstine (151502)
Katy Börner (151476)
Publication venue
Publication date
Field of study

The figure also references processing steps taken for the study by Boyack et al. (2011), which was centered on cluster quality.</p

FigShare

Parallelized batch training of the SOM, with 225 parallel processes.

Author: André Skupin (227241)
Joseph R. Biberstine (151502)
Katy Börner (151476)
Publication venue
Publication date
Field of study

Included is only the first of a total of 240 sequential batches, with the trained SOM serving as input to the subsequent batch.</p

FigShare